Meredith Brown
Last updated 2019-03-21
In this notebook, I explore the extent to which my city (and specific neighborhood) is plagued by rats and other animals, using 311 Daily Service Requests data from Somerville, MA. This data is available at:
https://data.somervillema.gov/311-Call-Center/311-Daily-Service-Requests/xs7t-pxkc
tl;dr: East Somerville does have more rats than West Somerville (but some areas are definitely worse than others).
I have also annotated this notebook with examples of best practices from the Tidyverse Style Guide, in purple, and my own style tips/tendencies, in green.
Setup
For instructions on how to set up the google maps API for use with ggmap, see this page.
knitr::opts_chunk$set(comment=NA, echo=FALSE, message=FALSE, warning=FALSE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(readr)
library(ggmap)
# google maps API setup
#register_google(key = "redacted", write = TRUE)
Import 311 data from web
all_data <- read_csv(url("https://data.somervillema.gov/api/views/xs7t-pxkc/rows.csv"))
print(all_data, width=Inf, n=5)
# A tibble: 100,447 x 12
ticket_id issue_type ticket_created_date_time
<dbl> <chr> <chr>
1 400000 Temporary no parking sign posting 07/01/2015 12:26:00 AM
2 400006 Temporary no parking sign posting 07/01/2015 08:14:00 AM
3 400007 Temporary no parking sign posting 07/01/2015 08:20:00 AM
4 400009 Abandoned property 07/01/2015 08:23:00 AM
5 400010 Abandoned property 07/01/2015 08:23:00 AM
submitter issue_description ticket_status ticket_last_updated_date_time
<dbl> <chr> <chr> <chr>
1 10675 Traffic and Parking Closed 07/01/2015 12:27:00 AM
2 42994 Traffic and Parking Closed 07/03/2015 04:27:00 PM
3 42995 Traffic and Parking Closed 04/06/2016 02:23:00 PM
4 35749 DPW Highway Closed 07/03/2015 08:43:00 AM
5 42997 DPW Highway Closed 07/01/2015 02:55:00 PM
secondary_issue_type neighborhood_district ticket_closed_date_time
<chr> <chr> <chr>
1 Service Requests Ward 1 07/01/2015 12:27:00 AM
2 Service Requests <NA> 07/03/2015 04:27:00 PM
3 Service Requests Ward 3 04/06/2016 02:23:00 PM
4 Service Requests Ward 2 07/03/2015 08:43:00 AM
5 Service Requests Ward 3 07/01/2015 02:55:00 PM
location
<chr>
1 "47 Florence St\nSomerville, MA\n(42.3840872, -71.0826777)"
2 "Somerville MA, NA\n(0.0, 0.0)"
3 "2 Evergreen Sq\nSomerville, MA\n(42.3865038, -71.1133853)"
4 "36 Rogers Ave\nSomerville, MA\n(42.3966294, -71.1136228)"
5 "23 Porter St\nSomerville, MA\n(42.3874315, -71.1128601)"
street_address
<chr>
1 47 Florence St
2 <NA>
3 2 Evergreen Sq
4 36 Rogers Ave
5 23 Porter St
# … with 1.004e+05 more rows
Plotting rat-related work orders by neighborhood over time
I also include other animal-related issues for comparison, such as wild animal inquiries and injured or dead animal reports.
Interim Tidyverse Style Guide notes:
-
Snake case for variable (and function) names, rather than camel case. Better for readability!
-
Informative variable (and function) names! Always!
-
Use spaces like you would in normal written language: after (but not before) commas, not between parentheses and their contents, before/after =, +, ==, etc.
-
If a function call gets super long, separate into more than one line (ideally in a coherent way)
-
When using %>% to pipe, each pipe step should get its own indented line (except for one-step pipes, which can stay on one line)
-
Comments: Not excessive, focus on why rather than what or how
My additions:
-
When, for example, doing mutates on different fields, I try to only group them if they’re related, and otherwise have separate mutate lines for separate steps. For example, I group together the gsubs that lump together related issue types (first mutate), but separate those from the datetime mutate (second mutate). This makes the pipe logic more readable and easy to follow.
-
Notebook organization: Short blocks, one main point per block
month_data <- all_data %>%
mutate(issue_type = gsub("Wild animal/rabies questions", "Wild animal/rabies inquiry", issue_type),
issue_type = gsub("Injured animal", "Injured or dead animal", issue_type),
issue_type = gsub("Dead animal", "Injured or dead animal", issue_type)) %>%
mutate(ticket_created_date_time = mdy_hms(ticket_created_date_time, tz = "EST")) %>%
separate(col = "ticket_created_date_time",
into = c("year_ticket_created", "month_ticket_created", "day_ticket_created",
"hour_ticket_created", "minute_ticket_created", "second_ticket_created")) %>%
mutate(month_year = factor(paste(year_ticket_created, month_ticket_created, sep="-"))) %>%
group_by(month_year, neighborhood_district, issue_type) %>%
summarise(count=n()) %>%
mutate(freq = count / sum(count)) %>%
filter(issue_type == "Rats" | issue_type == "Wild animal/rabies inquiry" |
issue_type == "Injured or dead animal") %>%
mutate(issue_type = factor(issue_type, levels = c("Rats", "Injured or dead animal",
"Wild animal/rabies inquiry")))
Some ggplot-related style tips:
-
Informative axis and legend labels! Always!
-
Axis labels should be spaced so as to be readable; here, I chose every 3rd month both for readability and interpretability (each label roughly corresponds to a season).
-
When adding elements to ggplot, I also usually put each element on its own line, as with piping, for readability
-
I also somewhat prefer using names for colors rather than hex to make it more readable (and easy for me to manipulate later), but not always possible!
ggplot(month_data, aes(x = month_year, y = count, colour = issue_type,
group = issue_type)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x = "Month", y = "Number of issues reported", colour = "Issue type") +
scale_x_discrete(breaks = unique(month_data$month_year)[seq(1, length(unique(month_data$month_year)),
by = 3)]) +
geom_point() +
geom_line() +
scale_colour_manual(values = c("darkgoldenrod2", "firebrick2", "slateblue1")) +
facet_wrap(~neighborhood_district, nrow = 4)

From the above plot we see some seasonal effects: Rat call volume peaks during the summer months and wanes during colder weather. We also see that Wards 1-3 (East Somerville, Inman & Union Squares) are the most affected by rats, though there are more reports from the other wards in more recent months. Finally, injured and dead animal reports are remarkably similar to rat reports in volume, ward distribution, and seasonality.
The low volume of calls in Wards 4-7 (Central/West Somerville) raises some questions about whether residents of those neighborhoods are less likely overall to call 311 than residents of Wards 1-3. Below is a plot of total volume of 311 work orders by ward.
year_data <- all_data %>%
mutate(issue_type = gsub("Wild animal/rabies questions", "Wild animal/rabies inquiry", issue_type),
issue_type = gsub("Injured animal", "Injured or dead animal", issue_type),
issue_type = gsub("Dead animal", "Injured or dead animal", issue_type)) %>%
mutate(year_ticket_created = year(mdy_hms(ticket_created_date_time, tz = "EST"))) %>%
group_by(year_ticket_created, neighborhood_district) %>%
summarise(count=n()) %>%
mutate(neighborhood_district = ifelse(is.na(neighborhood_district), "NA", neighborhood_district)) %>%
mutate(neighborhood_district = factor(neighborhood_district, levels = c("Ward 1", "Ward 2", "Ward 3",
"Ward 4", "Ward 5", "Ward 6",
"Ward 7", "NA")))
ggplot(year_data, aes(x = year_ticket_created, y = count, colour = neighborhood_district,
group = neighborhood_district)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x = "Year", y = "Number of issues reported", colour = "Issue type") +
geom_point() +
geom_line() +
scale_colour_manual(values = c("#377EB8", "#4DAF4A", "#984EA3", "#FF7F00",
"#ffb90f", "#A65628", "#F781BF", "#E41A1C"))

General accessibility note:
There is indeed a very low volume of 311 work orders in Wards 4-7 until 2018. So if we want to look at the overall proportion of 311 work orders related to rats (relative to all 311 work orders) by neighborhood, it makes sense to focus on data from 2018 onward.
rat_year_data <- all_data %>%
mutate(neighborhood_district = ifelse(is.na(neighborhood_district), "NA", neighborhood_district)) %>%
mutate(neighborhood_district = factor(neighborhood_district, levels = c("Ward 1", "Ward 2", "Ward 3",
"Ward 4", "Ward 5", "Ward 6",
"Ward 7", "NA"))) %>%
mutate(year_ticket_created = year(mdy_hms(ticket_created_date_time, tz = "EST"))) %>%
filter(year_ticket_created >= 2018) %>%
group_by(neighborhood_district, issue_type) %>%
summarise(count=n()) %>%
mutate(freq = count / sum(count)) %>%
filter(issue_type == "Rats")
ggplot(rat_year_data, aes(x = neighborhood_district, y = freq,
colour = neighborhood_district)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x = "", y = "Proportion of rat issues reported (2018-19)") +
theme(legend.position = "none") +
geom_bar(fill = "white", stat = "identity") +
ggtitle("Relative proportion of rat-related issues out of all 311 work orders (2018-2019)") +
scale_colour_manual(values = c("#377EB8", "#4DAF4A", "#984EA3", "#FF7F00",
"#ffb90f", "#A65628", "#F781BF", "#E41A1C"))

Ward 1 has the highest relative proportion of rat-related issues out of all 311 work orders, followed by Wards 2 and 4.
ggplot(rat_year_data, aes(x = neighborhood_district, y = count,
colour = neighborhood_district)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x = "", y = "Number of rat issues reported (2018-19)") +
theme(legend.position = "none") +
geom_bar(fill = "white", stat = "identity") +
ggtitle("Total counts of rat-related issues out of all 311 work orders (2018-2019)") +
scale_colour_manual(values = c("#377EB8", "#4DAF4A", "#984EA3", "#FF7F00",
"#ffb90f", "#A65628", "#F781BF", "#E41A1C"))

In terms of raw counts, Wards 1 and 2 have much higher rat counts in 2018-2019 than the other neighborhoods. In general, West Somerville has the fewest rat incidents, followed by Central Somerville, and East Somerville has the most rat incidents.
Mapping rat-related incidents
Where are the rat hotspots???
rat_map_data <- all_data %>%
mutate(neighborhood_district = ifelse(is.na(neighborhood_district), "NA", neighborhood_district)) %>%
mutate(coords = sub("\\).*", "", sub(".*\\(", "", location))) %>%
separate(coords, c('lat', 'lon'), sep = ", ", convert = TRUE) %>%
filter(lat!=0 & lon!=0) %>%
mutate(ticket_created_date_time = mdy_hms(ticket_created_date_time, tz = "EST")) %>%
separate(col = "ticket_created_date_time",
into = c("year_ticket_created", "month_ticket_created", "day_ticket_created",
"hour_ticket_created", "minute_ticket_created", "second_ticket_created")) %>%
mutate(month_year = factor(paste(year_ticket_created, month_ticket_created, sep="-")))
# set center of map
center.lat = 42.392050
center.lon = -71.109106
# time map
qmap(c(lon = center.lon, lat = center.lat), source = "google", maptype = "roadmap", zoom=14) +
geom_point(aes(x = lon, y = lat, color = month_year), size = 2, alpha = 0.5,
data = rat_map_data %>% filter(issue_type == "Rats")) +
scale_colour_grey(start = .9, end = 0, guide = FALSE)

Darker points represent more recent incidents and lighter-colored points indicate less recent incidents. There are notable rat clusters in Union Square, Winter Hill near Foss Park, and Ten Hills.
qmap(c(lon = center.lon, lat = center.lat), source = "google", maptype = "roadmap", zoom = 14) +
geom_point(aes(x = lon, y = lat, color = neighborhood_district), size = 2, alpha = 0.5,
data = rat_map_data %>% filter(issue_type == "Rats")) +
scale_colour_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
"#FF7F00", "#ffb90f", "#A65628", "#F781BF")) +
theme(legend.title=element_blank())

This map plots rat reports by ward. I expected clearer delineation of wards than this, perhaps the neighborhood field corresponds to individuals’ home addresses rather than the location of the rat itself?
For reference, here’s where the wards are:
